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Abstract 



We present a technique for constructing random fields from a set of training samples. The 
learning paradigm builds increasingly complex fields by allowing potential functions, or 
features, that are supported by increasingly large subgraphs. Each feature has a weight 
that is trained by minimizing the Kullback-Leibler divergence between the model and the 
empirical distribution of the training data. A greedy algorithm determines how features 
are incrementally added to the field and an iterative scaling algorithm is used to estimate 
the optimal values of the weights. 

The random field models and techniques introduced in this paper differ from those common 
to much of the computer vision literature in that the underlying random fields are non- 
Markovian and have a large number of parameters that must be estimated. Relations to 
other learning approaches including decision trees and Boltzmann machines are given. As 
a demonstration of the method, we describe its application to the problem of automatic 
word classification in natural language processing. 



1. Introduction 



In this paper we present a method for incrementally constructing random fields. Our 
method builds increasingly complex fields to approximate the empirical distribution of a 
set of training examples by allowing potential functions, or features, that are supported 
by increasingly large subgraphs. Each feature is assigned a weight, and the weights are 
trained to minimize the Kullback-Leibler divergence between the field and the empirical 
distribution of the training data. Features are incrementally added to a field using a 
top-down greedy algorithm. 

To illustrate the nature of our approach, suppose that we have a set of images we 
wish to characterize by a statistical model. Each image is represented by an assignment of 
one of the colors red, blue, or green to the vertices of a square, 2-dimensional grid. How 
should the statistical model be constructed? 

To begin, suppose we observe that vertices of the images are 50% red, 30% blue and 
20% green. This leads us to characterize the statistical model in terms of the average 
number of vertices of each color, in a distribution of the form 



where u>i is the color of vertex i in the image u. The weights A r , A&, \ g are chosen to reflect 
our observations of the frequencies of colors of individual vertices in the set of images. The 
more detailed observation that a red vertex is only rarely adjacent to a green vertex might 
then be included as a refinement to the model, leading to a distribution of the form 



where the weight A rjS is adjusted to reflect our specific observations on the colors of adjacent 
vertices, and any necessary readjustments are made to the weights A r , A&, \ g to respect our 
earlier observations. At the expense of an increasing number of parameters that need to 
be adjusted, an increasingly detailed set of features of the images can be characterized 
by the distribution. But which features should the model characterize, and how should 
the weights be chosen? In this paper we present a general framework for addressing these 
questions. 

As another illustration, suppose we wish to automatically characterize spellings of 
words according to a statistical model; this is the application we develop in Section 5. A 
field with no features is simply a uniform distribution on ASCII strings (where we take the 
distribution of string lengths as given). The most conspicuous feature of English spellings 
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is that they are most commonly comprised of lower-case letters. The induction algorithm 
makes this observation by first constructing the field 



where x 1S an indicator function and the weight A[ a _ z j associated with the feature that a 
character is lower-case is chosen to be approximately 1.944. This means that a string with 
a lowercase letter in some position is about 7 ~ e 1 ' 944 times more likely than the same 
string without a lowercase letter in that position. The following collection of strings was 
generated from the resulting field by Gibbs sampling: 



m, r, xevo, ijjiir, b, to, jz, gsr, wq, vf , x, ga, msmGh, 
pep, d, oziVlal, hzagh, yzop, io, advzmxnv, ijv_bolft, x, 
emx, kayerf , mlj , rawzyb, jp, ag, ctdnnnbg, wgdw, t, kguv, 
cy, spxcq, uzflbbf, dxtkkn, cxwx, jpd, ztzh, lv, zhpkvnu, 
1~, r, qee, nynrx, atze4n, ik, se, w, lrh, hp+, yrqyka'h, 
zengotenx, igcump, zjcjs, lqpWiqu, cefmfhc, o, lb, fdcY, tzby, 
yopxmvk, by, fz,, t, govyeem, ijyiduwfzo, 6xr, duh, ejv, pk, 
pjw, 1, fl, w 



The second most important feature, according to the algorithm, is that two adjacent lower- 
case characters are extremely common. Accordingly, the second-order field becomes 

p l( u -j _ J_gS;~j A [a-z][a-z]X[a-z](^)X[a-z](wj) + ^ ; A [a _ z] X[a-z](^') 

where the weight A[ a _ z j[ a _ z ] associated with adjacent lower-case letters is approximately 
1.80. 

The first 1000 features that the algorithm induces include the strings s>, <re, ly>, 
and ing>, where the character "<" denotes beginning-of-string and the character ">" de- 
notes end-of-string. In addition, the first 1000 features include the regular expressions 
[0-9] [0-9] (with weight 9.15) and [a-z] [A-Z] (with weight -5.81) in addition to the 
first two features [a-z] and [a-z] [a-z] . A set of strings obtained by Gibbs sampling 
from the resulting field is shown below: 
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was, reaser, in, there, to, will, ,, was, by, homes, thing, 
be, reloverated, ther, which, conists, at, fores, anditing, with, 
Mr., proveral , the, ,, ***, on't, prolling, prothere, ,, mento, 
at, yaou, 1, chestraing, for, have, to, intrally, of, qut , ., 
best, compers, ***, cluseliment, uster, of, is, deveral , this, 
thise, of, of feet, inatever, thifer, constranded, stater, vill, 
in, thase, in, youse, menttering, and, ., of, in, verate, of, to 

These examples are discussed in detail in Section 5. 

The induction algorithm that we present has two parts: feature selection and param- 
eter estimation. The greediness of the algorithm arises in feature selection. In this step 
each feature in a pool of candidate features is evaluated by estimating the reduction in the 
Kullback-Leibler divergence that would result from adding the feature to the field. This 
reduction is approximated as a function of a single parameter, and the largest value of this 
function is called the gain of the candidate. The candidate with the largest gain is added 
to the field. In the parameter estimation step, the parameters of the field are estimated 
using an iterative scaling algorithm. The algorithm we use is a new statistical estimation 
algorithm that we call Improved Iterative Scaling. It is an improvement of the Generalized 
Iterative Scaling algorithm of Darroch and Rat cliff [19] in that it does not require that 
the features sum to a constant. The improved algorithm is easier to implement than the 
Darroch and Rat cliff algorithm, and can lead to an increase in the rate of convergence by 
increasing the size of the step taken toward the maximum at each iteration. In Section 4 
we give a simple, self-contained proof of the convergence of the improved algorithm that 
does not make use of the Kuhn- Tucker theorem or other machinery of constrained opti- 
mization. Moreover, our proof does not rely on the convergence of alternating I-projection 
as in Csiszar's proof [16] of the Darroch- Rat cliff procedure. 

Both the feature selection step and the parameter estimation step require the solution 
of certain algebraic equations whose coefficients are determined as expectation values with 
respect to the field. In many applications these expectations cannot be computed exactly 
because they involve a sum over an exponentially large number of configurations. This 
is true of the application that we develop in Section 5. In such cases it is possible to 
approximate the equations that must be solved using Monte Carlo techniques to compute 
expectations of random variables. The application that we present uses Gibbs sampling to 
compute expectations, and the resulting equations are then solved using Newton's method. 
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Our method can be viewed in terms of the principle of maximum entropy [26], which 
instructs us to assume an exponential form for our distributions, with the parameters 
viewed as Lagrange multipliers. The techniques that we develop in this paper apply to 
exponential models in general. We formulate our approach in terms of random fields 
because this provides a convenient framework within which to work, and because our main 
application is naturally cast in these terms. 

Our method differs from the most common applications of statistical techniques in 
computer vision and natural language processing. In contrast to many applications in 
computer vision, which involve only a few free parameters, the typical application of our 
method involves the estimation of thousands of free parameters. In addition, our methods 
apply to general exponential models and random fields-there is no underlying Markov 
assumption made. In contrast to the statistical techniques common to natural language 
processing, in typical applications of our method there is no probabilistic finite-state or 
push-down automaton on which the statistical model is built. 

In the following section we describe the form of the random field models considered in 
this paper and the general learning algorithm. In Section 3 we discuss the feature selection 
step of the algorithm and briefly address cases when the equations need to be estimated 
using Monte Carlo methods. In Section 4 we present the Improved Iterative Scaling al- 
gorithm for estimating the parameters, and prove the convergence of this algorithm. In 
Section 5 we present the application of inducing features of spellings, and finally in Sec- 
tion 6 we discuss the relation between our methods and other learning approaches, as well 
as possible extensions of our method. 

2. The Learning Paradigm 

In this section we present the basic algorithm for building up a random field from elemen- 
tary features. The basic idea is to incrementally construct an increasingly detailed field 
to approximate a reference distribution p. Typically the distribution p is obtained as the 
empirical distribution of a set of training examples. After establishing our notation and 
defining the form of the random field models we consider, we present the training problem 
as a statement of two equivalent optimization problems. We then discuss the notions of a 
candidate feature and the gain of a candidate. Finally, we give a statement of the induction 
algorithm. 

2.1 Form of the random field models. Let G = (E } V) be a finite graph with vertex set V 
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and edge set E, and let A be a finite alphabet. The configuration space fi is the set of 
all labelings of the vertices in V by letters in A. If C C V and u £ fi is a configuration, 
then cue* denotes the configuration restricted to C . A random field on G is a probability 
distribution on fi. The set of all random fields is nothing more than the simplex A of all 
probability distributions on fi. If / : fi — > R then the support of /, written supp(/), is the 
smallest vertex subset C C V having the property that whenever £ fi with uq = <jo' c 

then f{u) = f(u'). 

We consider random fields that are given by Gibbs distributions of the form 

p{u) = -e^c Vc ^ (2.1) 
Z 

for io £ fi, where Vc : fi — > R are functions with supp(Vc) = C . The field is Markov if 
whenever Vc 7^ then C is a clique, or totally connected subset of V . This property is 
expressed in terms of conditional probabilities as 

p(u u \ u v , v ^ u) = p(u u \ u v , (u,v) £ E) (2.2) 

where u and v are arbitrary vertices. We assume that each C is a path-connected subset 
of V and that 

Vc(u)= ]T a?/?h = a c ./ c h (2.3) 

where \f £ R and ff (u>) £ {0, 1}. We say that the values \f are the parameters of the 
field and that the functions ff are the features of the field. In the following, it will often be 
convenient to use notation that disregards the dependence of the features and parameters 
on a vertex subset C, expressing the field in the form 

p(cu) = I e E i ^^(-) = I e A-/M. (2 . 4) 
Zi Zi 

For every random field (E } V } {A^, fi}) of the above form, there is a field (E' , V } {A^, fi}) 
that is Markovian, obtained by completing the edge set E to ensure that for each i } the 
subgraph generated by the vertex subset C = supp(/i) is totally connected. 

If we impose the constraint A^ = \j on two parameters A^ and Aj, then we say that 
these parameters are tied. If A^ and Aj are tied, then we can write 

\ z f z (u) + \ J f J (u) = \g(u) (2.5) 

where g = fi + fj is a non-binary feature. In general, we can collapse any number of 
tied parameters onto a single parameter associated with a non-binary feature. Having 
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tied parameters is often natural for a particular problem, but the presence of non-binary 
features generally makes the estimation of parameters more difficult. 

A random field (E } V } {A^, fi}) is said to have homogeneous features if for each fea- 
ture fi and automorphism a of the graph G = (E } V) } there is a feature fj such that 
fj(cruj) = fi(io) for u £ fi. If in addition Aj = A^, then the field is said to be homogeneous . 
Homogeneous features arise naturally in the application of Section 5. 

The methods that we describe in this paper apply to exponential models in general; 
that is, it is not essential that there is an underlying graph structure. However, it will be 
convenient to express our approach in terms of the random field models described above. 

2.2 Two optimization problems. Suppose that we are given an initial model qo £ A, a 
reference distribution p, and a set of features / = (/o,/i, •••,/«)• I n practice, it is often 
the case that p is the empirical distribution of a set of training samples a/ 1 -*, a/ 2 -* . . .u^ N \ 
and is thus given by 

PH = -Jf- (2-6) 

where c(to) = X]i<i<jv %! w^) is the number of times that configuration to appears 
among the training samples. 

We wish to construct a probability distribution q* £ A that accounts for these data, in 
the sense that it approximates p but does not deviate too far from qo. We measure distance 
between probability distributions p and q in A using the Kullback-Leibler divergence 

D(p||p) = ]Tp»log^M. (2.7) 
Throughout this paper we use the notation 

p[g] = g(v)p(u) 

for the expectation of a function g : fi — > R with respect to the probability distribution p. 
For a function h : fi — > R and a distribution q, we use both the notation hoq and q^ to 
denote the generalized Gibbs distribution given by 

q h (u) = (hoq)(u) = ^- ) e h ^q(u). 

Note that Z q {h) is not the usual partition function. It is a normalization constant de- 
termined by the requirement that (hoq)(u) sums to 1 over u, and can be written as an 
expectation: 

Z q (h)=q[e h ]. 
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There are two natural sets of probability distributions determined by the data p, go, 
and /. The first is the set V(f,p) of all distributions that agree with p as to the expected 
value of the feature function /: 

V(f,p) = { P eA: p[f]=p[f]}. 

The second is the set Q(f } qo) of generalized Gibbs distributions based on qo with feature 
function /: 

Q(f,qo) = {(X-f)oq : A £ R n } . 

We let qo) denote the closure of qo) in A (with respect to the topology it inherits 
clS cl subset of Euclidean space). 

There are two natural criteria for choosing q+: 

• Maximum Likelihood Gibbs Distribution. Choose q* to be a distribution in Q(f } qo) 
with maximum likelihood with respect to p: 

q± = argmin D(p || q) 
?eQ(/,?o) 

• Maximum Entropy Constrained Distribution. Choose q* to be a distribution in V(f,p) 
that has maximum entropy relative to qo: 

q± = argminD(p|| qo) 

peT(f,p) 

Although these criteria are different, they determine the same distribution. In fact, the 
following is true, as we prove in Section 4. 

Proposition. Suppose that D(p || qo) < oo. Then there exists a unique q± £ A satisfying 

(1) q*eV(f,p)nQ(f,q ) 

(2) D(p || q) = D(p || q+) + D(q* || q) for any p £ V(f,p) and q £ Q(f } q ) 

(3) q± = argmin D(p || q) 

?es(/,?o) 

(4) q± = argmin D(p \\ qo). 

peT(f,p) 

Moreover, any of these four properties determines q± uniquely. 

When p is the empirical distribution of a set of training examples a/ 1 -*, a/ 2 -* . . .u^ N \ 
minimizing D(p || p) is equivalent to maximizing the probability that the field p assigns to 
the training data, given by 

H p(J») = n^) cM « e~ ND( P^K (2.8) 

l<i<JV u£S! 
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With sufficiently many parameters it is a simple matter to construct a field for which 
D(p || p) is arbitrarily small. In fact, we can construct a field with iV + 1 features and small 
Kullback-Leibler divergence with respect to p by taking 

fi(u) = 6(u;,u;W), \i = \ogc(u;W) (2.9) 

for 1 < i < N and 

/jv+iM= J] (WiM), A JV+1 «-1. (2.10) 

l<z<JV 

While such a model has small divergence with respect to the empirical distribution of the 
samples it does not generalize to other, previously unseen configurations. This is the 
classic problem of over-training. To avoid this problem we seek to incrementally construct 
a field that captures the salient properties of p by incorporating an increasingly detailed 
collection of features. This motivates the random field induction paradigm that we now 
present. 

2.3 Inducing field interactions. We begin by supposing that we have a set of atomic features 

Atomic C {g : O — y {0, 1}, supp(g) = v g G V} 

each of which is supported by a single vertex. We use atomic features to incrementally 
build up more complicated features. The following definition specifies how we shall allow 
a field to be incrementally constructed, or induced. 

Definition 2.1. Suppose that the held q is given by q = (A • /) oq . The features fi are 
called the active features of q. A feature g is a candidate for q if either g G ^atomic, or if 
g is of the form g(io) = a(u)fi(u) for an atomic feature a and an active feature fi with 
supp(g) supp(/i) G E. The set of candidate features of q is denoted C(q). 

In other words, candidate features are obtained by conjoining atomic features with existing 
features. The condition on supports ensures that each feature is supported by a path- 
connected subset of G. As an illustration, the figure below shows a situation in which the 
underlying graph G is a grid, and a feature is supported by five vertices. The dashed lines 



8 



indicate edges that would need to be present for the underlying field to be Markovian. 



Figure 1 



If g G C(q) is a candidate feature of q, then we call the 1-parameter family of random 
fields q ag = (ag)oq the induction of q by g. We also define 

G q (a,g) = D(p\\q) - D(p\\q ag ) . (2.11) 

We think of G q (a } g) as the improvement that feature g brings to the model when it has 
weight a. As we show in the following section, G q (a } g) is H-convex in a. We define G q (g) 
to be the greatest improvement that feature g can give to the model while keeping all of 
the other features' parameters fixed: 

G q (g) = sup G q (a,g) . (2.12) 

a 

We refer to G q (g) as the gain of the candidate g. 



2.4 Incremental construction of random fields. We can now describe our algorithm for 
incrementally constructing fields. 

Field Induction Algorithm. 

Initial Data: 

A reference distribution p and an initial model qo. 
Output: 

A Held q± with active features /o, . . . , /jv such that q± = arg min D(p || q). 

?es(/,?o) 
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Algorithm: 

(0) Set = g . 

(1) For each candidate g £ C(q ( - n ^) compute the gain G q ( n )(g). 

(2) Let f n = argmax G q (n)(g) he the feature with the largest gain. 



9 ec(?(")) 



(3) Compute q± = argmin D(p \\ q), where f = (/ , /i, • • • , f n )- 



?€S(/,? ) 



(4) Set ^f( n + 1 ) = q+ and n n + 1, and go to step (1). 



This induction algorithm has two parts: feature selection and parameter estimation. 
Feature selection is carried out in steps (1) and (2), where the feature yielding the largest 
gain is incorporated into the model. Parameter estimation is carried out in step (3), 
where the parameters are adjusted to best represent the reference distribution. These two 
computations are discussed in more detail in the following two sections. 



The feature selection step of our induction algorithm is based upon an approximation. We 
assume that we can estimate the improvement due to adding a single feature, measured by 
the reduction in Kullback-Leibler divergence, by adjusting only the weight of the feature 
and keeping all of the other parameters of the field fixed. In general this is only an estimate, 
and it may well be that adding a feature will require significant adjustments to all of 
the parameters in the new model. From a computational perspective, approximating the 
improvement in this way can enable the simultaneous evaluation of thousands of candidate 
features, and makes the algorithm practical. In this section we present further detail on 
the feature selection step. 

Proposition 3.1. Let G q (a,g), defined in (2.11), he the approximate improvement ob- 
tained by adding feature g with parameter a to the field q. Then if g is not constant, 
G q (ct,g) is strictly D-convex in a and attains its maximum at the unique point a satisfy- 



3. Feature Selection 



mg 



P[g] = Q&g [g] ■ 
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Proof. Using the definition (2.7) of the Kullback-Leibler divergence we can write 

^ Z- 1 (ag)e a 3(^q(u) 
G q (a,g)= ) p(u)log — 

= ^pH(^H-i og(Z [ e ^]) 

= <*p[g] - logq [e ag ] . 



f3.2) 



Thus 



G q (a,g) = p[g\ 



M. 



oreover, 



da " v q[e a 9] 

= P[g] ~ Qag [g] ■ 

d 2 = q[ge a °} 2 q[g 2 e^} 

da 2 q[a}9) q[e a 9]2 q [ e *9] 

= -q ag [{g - q ag [g]f] 

Hence, -J^Gq^a, g) < 0, so that G q (a,g) is H-convex in a. If g is not constant, then 
-J^2G q (ct, g), which is minus the variance of g with respect to q ag , is strictly negative, so 
that G q (a } g) is strictly convex. □ 

When g is binary- valued, its gain can be expressed in a particularly nice form. This 
is stated in the following proposition, whose proof is a simple calculation. 

Proposition 3.2. Suppose that the candidate g is binary-valued. Then G q (a } g) is max- 
imized at 

~ , f p[g](i-q[g}) \ ^ 

\q[g]{l -p[g])J 

and at this value, 

G q (g) = G q (a,g) = D(B p \\B q ) (3.4) 
where B p and B q are Bernoulli random variables given by 

B p (l)=p[g] B p (0) = l-p[g] 
B q {l)=q[g] B q (0) = l-q[g}. 

For features that are not binary- valued, but instead take values in the positive integers, 
the parameter a that solves (3.1) and thus maximizes G q (a } g) cannot, in general, be 
determined in closed form. This is the case for tied binary features, and it applies to 
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the application we describe in Section 5. For these cases it is convenient to rewrite (3.1) 
slightly. Let f3 = e a so that d/da = fid/dfi. Let 

9k = *Ylq(v)S(k,g(u)) (3.6) 

id 

be the total probabilty assigned to the event that the feature g takes the value k. Then 
(3.1) becomes 

[3^- G g (log /?, g) = p[g] - ^°***f* = (3.7) 

This equation lends itself well to numerical solution. The general shape of the curve 
(3 i — y fid/dfi G q (\og (3,g) is shown in the figure below. 



Figure 2 

The limiting value of (3dG q (log (3 } g)/d(3 as (3 — > oo is N — p[g]. If N — p[g] < then there 
is no solution to equation (3.7). Otherwise, the solution can be found using Newton's 
method, which in practice converges rapidly for such functions. 

When the configuration space is large, so that the coefficients cannot be calculated 
by summing over all configurations, Monte Carlo techniques may be used to estimate them. 
It is important to emphasize that the same set of random configurations can be used to 
estimate the coefficients g^ for each candidate g simultaneously. Rather than discuss the 
details of Monte Carlo techniques for this problem we refer to the extensive literature on 
this topic. We have obtained good results using the standard technique of Gibbs sampling 
[25] for the problem we describe in Section 5. 



4. Parameter Estimation 

In this section we present an algorithm for selecting the parameters associated with the 
features of a random field. The algorithm is closely related to the Generalized Iterative 
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Scaling algorithm of Darroch and Rat cliff [19]. Like the Darroch and Rat cliff procedure, 
the algorithm requires that the features fi are non-negative: fi(io) > for all u £ fi. 
Unlike the Darroch and Rat cliff procedure, however, our method does not require the 
features to be normalized to sum to a constant. 

Throughout this section we hold the set of features / = (/o, fi , . . . , f n ) } the initial 
model qo and the reference distribution p fixed, and we simplify the notation accordingly. 
In particular, we write 705 instead of (7 • /) o q for 7 £ R n . We assume that qo(u) = 
whenever p(u) = 0. This condition is commonly written p <C qo } and it is equivalent to 
D(p || qo) < 00. 

A description of the algorithm requires an additional piece of notation. Let 



i=0 

If the features are binary, then f#(uj) is the total number of features that are "on" for the 
configuration u. 

Improved Iterative Scaling. 



Initial Data: 

A reference distribution p and an initial model qo, with p <^ qo, and 
non-negative features /o, fi , . . . , f n . 

Output: 

The distribution q± = argmin D(p || q) 



In other words, this algorithm constructs a distribution q* = lim^^oo 7^ o q where 7^ = 



n 






J^_ 7j and 7^ is determined as the solution to the equation 
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When used in the n-th iteration of the field induction algorithm, where a candidate feature 
g = f n is added to the field q 

— Qm we choose the initial distribution qo to be qo — Qagi 
where a is the parameter that maximizes the gain of g. In practice, this provides a good 
starting point from which to begin iterative scaling. In fact, we can view this distribution 
as the result of applying one iteration of an Iterative Proportional Fitting Procedure [8,15] 
to project q ag onto the linear family of distributions with (/-marginals constrained to p[g]. 
Our main result in this section is 

Proposition 4.1. Suppose is the sequence in A determined by the Improved Iterative 

Scaling algorithm. Then D(p || q^) decreases monotonically to D(p || q+) and q^ converges 

to q± = arg min D(p || q) = arg min D(p || qo). 
q£Q pev 

In the remainder of this section we present a self-contained proof of the convergence of 
the algorithm. The key idea of the proof is to express the incremental step of the algorithm 
in terms of an auxiliary function which bounds from below the likelihood objective function. 
This technique is the standard means of analyzing the EM algorithm [21], but it has not 
previously been applied to iterative scaling. Our analysis of iterative scaling is different 
and simpler than previous treatments. In particular, in contrast to Csiszar's proof of the 
Darroch- Rat cliff procedure [16], our proof does not rely upon the convergence of alternating 
I-projection [15]. 

We begin by proving the basic duality theorem which states that the maximum like- 
lihood problem for a Gibbs distribution and the maximum entropy problem subject to 
linear constraints have the same solution. We then turn to the task of computing this 
solution. After introducing auxiliary functions in a general setting, we apply this method 
to prove convergence of the Improved Iterative Scaling algorithm. We finish the section 
by discussing Monte Carlo methods for estimating the equations when the size of the 
configuration space prevents the explicit calculation of feature expectations. 

4-1 Duality. In this section we prove 

Proposition 4.2. Suppose that p <^ qo- Then there exists a unique q± £ A satisfying 

(1) £ vn Q 

(2) D(p || q) = D(p || q-k) + D(q± || q) for any p £ V and q £ Q 

(3) q-k = arg min D(p || q) 

(4) q± = arg min D(p || qo). 

p£V 

Moreover, any of these four properties determines q± uniquely. 
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This result is well known, although perhaps not quite in this packaging. In the lan- 
guage of constrained optimization, it expresses the fact that the maximum likelihood prob- 
lem for Gibbs distributions is the convex dual to the maximum entropy problem for linear 
constraints. We include a proof here to make this paper self-contained and also to carefully 
address the technical issues arising from the fact that Q is not closed. The proposition 
would not be true if we replaced Q with Q. In fact, VOQ might be empty. Our proof is ele- 
mentary and does not rely on the Kuhn- Tucker theorem or other machinery of constrained 
optimization. 

Our proof of the proposition will use a few lemmas. The first two lemmas we state 
without proof. 

Lemma 4.3. 

(1) D(p || q) is a non-negative, extended real-valued function on A X A. 

(2) D(p || q) = if and only if p = q. 

(3) D(p || q) is strictly convex in p and q separately. 

(4) D(p || q) is C 1 in q. 

Lemma 4.4. 

(1) The map (7,p) i— >■ 7°p is smooth in (7,p) G R n X A. 

(2) The derivative of D(p\\ \oq) with respect to A is 

j t \ t=0 D(p\\(t\)oq) = \.(p[f]-q[f]). 
Lemma 4.5. If p <^ qo then V D Q is nonempty. 

Proof. Define q* by property (3) of Proposition 4.2; that is, q* = arg min gG g D(p } q)- To 
see that this makes sense, note that since p <^ go, D(p } q) is not identically oo on Q. Also, 
|| q) is continuous and strictly convex as a function of q. Thus, since Q is closed, 
D(p } q) attains its minimum at a unique point q* G Q. We will show that q* is also in V. 
Since Q is closed under the action of R n , A o q+ is in Q for any A. Thus by the definition of 
A = is a minimum of the function A — > D(p } Ao^). Taking derivatives with respect 
to A and using Lemma 4.4 we conclude q*[f] = p[f]- Thus q* G V. □ 

Lemma 4.6. If q* G V fl Q then for any p G V and q G Q 

D(p || q) = D(p || q+) + D(q* || q) . 

This is called the Pythagorean property since it resembles the Pythagorean theorem if we 
imagine that D(p || q) is the square of Euclidean distance and (p, q+, q) are the vertices of 
a right triangle. 
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Proof. A straightforward calculation shows that 

D( Pl || qi ) - D( Pl || q 2 ) - D(p 2 || qi ) + D(p 2 || q 2 ) = A • - p 2 [f]) 

for any pi , p 2 , 51 , q 2 G A with 52 = A o . It follows from this identity and the continuity 
of D that 

D( Pl || qi ) - D( Pl || q 2 ) - D(p 2 || qi ) + £>(p 2 || q 2 ) = 
if Pi 7 p 2 G "P and 51, 52 G Q. The lemma follows by taking p\ = q\ = q±. □ 

Proof of Proposition 4-2. Choose q* to be any point mVOQ. Such a q* exists by Lemma 
4.5. It satisfies property (1) by definition, and it satisfies property (2) by Lemma 4.6. As 
a consequence of property (2), it also satisfies properties (3) and (4). To check property 
(3), for instance, note that if q is any point in Q, then D(p || q) = D(p || q+) + D(q± || q) > 
D{q* || q). 

It remains to prove that each of the four properties (l)-(4) determines q* uniquely. 
In other words, we need to show that if m is any point in A satisfying any of the four 
properties (l)-(4), then m = q*. Suppose that m satisfies property (1). Then by property 
(2) for q* with p = q = m, D(m } m) = D(m } q+) + D(q± } m). Since D(m } m) = 0, it follows 
that Dim^q^) = so m = q*. If m satisfies property (2), then the same argument with q± 
and m reversed again proves that m = q*. Suppose that m satisfies property (3). Then 

D(p || q*) > D(p || m) = D(p || q+) + D(q* || m) 

where the second equality follows from property (2) for q*. Thus D(q± } m) < so m = q*. 
If m satisfies property (4), then a similar proof shows that once again m = q*. □ 

4-2 Auxiliary functions. In the previous section we proved the existence of a unique prob- 
ability distribution q* that is both a maximum likelihood Gibbs distributions and a maxi- 
mum entropy constrained distribution. We now turn to the task of computing q*. 
Fix p and let L : A — > R be the log-likelihood objective function 

L(q) = -D(p\\q). 

Definition 4.7. A function A : R n x A — > R is an auxiliary function for L if 
(1) For all q G A and 7 G R n 

L^oq) > L(q)+A(i,q) 
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(2) -A(7, q) is continuous in q £ A and C 1 in 7 £ R n with 

d d 
A(0,q)=0 and —\ t=0 A(t^,q) = —\ t=0 L((t^)oq). 

We can use an auxiliary function A to construct an iterative algorithm for maximizing 
L. We start with q^ = qo and recursively define q( k+1 ) by 

q (k+l) =1 (k) Qq (k) 7 (*) =argmax ^( 7j g(*)). 

7 

It is clear from property (1) of the definition that each step of this procedure increases L. 
The following proposition implies that in fact the sequence q^ will reach the maximum 
of L. 

Proposition 4.8. Suppose q^ is any sequence in A with 

qW=q and q( k +V = q™ 

where 7^ £ R n satisfies 

A(^ k \qW) = su P A( 7 ,gW). (4.4) 

7 

Then L(q^) increases monotonically to max L(q) and q^ converges to q± = argmax L(q). 

qeQ q( zQ 

Equation (4.4) assumes that the supremum sup 7 A(*y, q^ ) is achieved at finite 7. In the 
next section, under slightly stronger assumptions, we present a extension of Proposition 
4.8 that allows some components of 7^ to take the value —00. 

To use the proposition to construct a practical algorithm we must determine an aux- 
iliary function A( 7 , q) for which 7^ satisfying the required condition can be determined 
efficiently. In the Section 4.3 we present a choice of auxiliary function which yields the 
Improved Iterative Scaling updates. 

To prove Proposition 4.8 we first prove three lemmas. 

Lemma 4.9. If m is a cluster point of q^ k \ then A( 7 , m) < for all 7 £ R n . 
Proof. Let q( k ^ be a sub-sequence converging to m. Then for any 7 

A( 7 ,g(*')) < A(^ k '\q^) < + - L(q^) < L(q {kl +^) - L(q^) . 

The first inequality follows from property (4.4) of ^ nk \ The second and third inequalities 
are a consequence of the monotonicity of L(q^). The lemma follows by taking limits and 
using the fact that L and A are continuous. □ 
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Lemma 4.10. If m is a cluster point of q^ k \ then -j^\t=o L((t^j)om) = for any 7 G R n . 

Proof. By the previous lemma, A(*y, m) < for all 7. Since A(0, m) = 0, this means that 
7 = is a maximum of A(*y, m) so that 

= j t \t=o A(t~f, m) = I f=0 L((<7) o m) . 

□ 

Lemma 4.11. Suppose {q^} is any sequence with only one cluster point q*. Then q^ 
converges to q*. 

Proof. Suppose not. Then there exists an open set B containing q* and a subsequence 
q(n k ) ^ j$ Since A is compact, q^ nk ^ has a cluster point q^ B. This contradicts the 
assumption that {q^} has a unique cluster point. □ 

Proof of Proposition Suppose that m is a cluster point of q^ k K It follows from 

Lemma 4.10 that t =o L((t^j) oq) = 0, and so m G V D Q by Proposition 4.4. But q* is 
the only point in V D Q by Proposition 4.2. It follows from Lemma 4.11 that q^ converges 
to q±. □ 



4-3 Dealing with 00. In order to prove the convergence of the Improved Iterative Scaling 
algorithm, we need an extension of Proposition 4.8 that allows the components of 7 to 
equal —00. For this extension, we assume that all the components of the feature function 
/ are non-negative: 

fi(u) > for all i and all u. (4.5) 

Let R U — 00 denote the partially extended real numbers with the usual topology. The 
operations of addition and exponentiation extend continuously to R U —00. Let S be the 
open subset of (R U — 00) n X A defined by 

S = { (7, q) G (R U -oo) n X A : g(cu)e T/(u;) > for some u } 

Observe that R n X A is a dense subset of S. The map (7,5) 1— > 7°p, which up to this 
point we defined only for finite 7, extends uniquely to a continuous map from all of S to 
A. (The condition on (7,5) G S ensures that the normalization in the definition of 7 op is 
well defined, even if 7 is not finite.) 
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Definition 4.12. We call a function A : S — > R U — oo an extended auxiliary function for 
L if when restricted to R n X A it is an ordinary auxiliary function in the sense of Definition 
4. 7, and if, in addition, it satisfies property (1) of Definition 4. 7 for any (q, 7) £ S, even if 
7 is not finite. 

Note that if an ordinary auxiliary function extends to a continuous function on <S, then 
the extension is an extended auxiliary function. 

We have the following extension of Proposition 4.8: 

Proposition 4.8'. Suppose the feature function f satisfies the non-negativity condition 
(4.5) and suppose A is an extended auxiliary function for L. Then the conclusion of 
Proposition 4.8 continues to hold if the condition on 7^ is replaced by: 

(l (k \q (k) ) e S and A(j (k \q (k) ) > A(j,q (k) ) for any (7 , q (k) ) £ S . 

Proof. Lemma 4.9 is valid under the altered condition on 7^ since A(*y, q) satisfies prop- 
erty (1) of Definition 4.7 for all (7,5) £ S. As a consequence, Lemma 4.10 also is valid, 
and the proof of Proposition 4.8 goes through without change. □ 



4-4 Improved Iterative Scaling. We now prove the monotonicity and convergence of the 
Improved Iterative Scaling algorithm by applying Proposition 4.8 to a particular choice of 
auxiliary function. We continue to assume, as in the previous section, that each component 
of the feature function / is non-negative. 
For q £ A and 7 £ R n , define 

A( 7 , q) = 1 + 7 • p[f] - q(") E ft 1 I ") 6 ^ /#M 

id i 

where f(i \to) = j'Jfy ■ It is easy to check that A extends to a continuous function on 
(RU -oo) n x A. 

Lemma 4.13. A("f,q) is an extended auxiliary function for L(q). 

The key ingredient in the proof of the lemma is the H-convexity of the logarithm and the 
U-convexity of the exponential, as expressed in the inequalities 

e E,- tiai <J2^ e ai ^ U > with U = 1 ( 46 ) 

i i 

log x < x — 1 for all x > . (4-7) 
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Proof of Lemma J^.13. Because A extends to a continuous function on (R U — oo) n X A, it 
suffices to prove that it satisfies properties (1) and (2) of Definition 4.7. To prove property 

(1) note that 

L( 7 o q) - L(q) = 7 • p[f] - log ]T q(u) e ^(<"> (4.8) 
>7-p[/]+l-E9H er/M ( 49 ) 

> 7 • Pin + 1 - E E ^ i w ) e7i /#M ( 4 - 10 ) 

= A( 7 ,q). 

Equality (4.8) is a simple calculation. Inequality (4.9) follows from inequality (4.7). In- 
equality (4.10) follows from the definition of f # and Jensen's inequality (4.6). Property 

(2) of Definition 4.7 is straightforward to verify. □ 

Proposition 4.1 follows immediately from the above lemma and the extended Propo- 
sition 4.8. Indeed, it is easy to check that 7^ defined in Proposition 4.1 achieves the 
maximum of A(7,q r ^- ) ), so that it satisfies the condition of Proposition 4.8'. 

4-5 Monte Carlo methods. The Improved Iterative Scaling algorithm described above is 
well-suited to numerical techniques since all of the features take non-negative values. In 
each iteration of this algorithm it is necessary to solve a polynomial equation for each 
feature fi. That is, we can express equation (4.2) in the form 

M 

E a K ^ = 

m=0 

where M is the largest value of f#(uj) = fi(io) and 



E w 9 (fc) H/iH*K/#H) ™>o 

-p[fi] m = 



«S = < (4.11) 



( U\ . . . (. k ) . . 

where q ( ' is the field for the &-th iteration and (3i = e 7 *' . This equation has no solution 

(k) 

precisely when a m • = for m > 0. Otherwise, it can be efficiently solved using Newton's 

' . (k) 

method since all of the coefficients a m i , m > 0, are non-negative. When Monte Carlo 

' . . . (k) 

methods are to be used because the configuration space fi is large, the coefficients a m • 
can be simultaneously estimated for all i and m by generating a single set of samples from 
the distribution q^ k K 
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5. Application: Word Morphology 



Word clustering algorithms are useful for many natural language processing tasks. One 
such algorithm [10], called mutual information clustering, is based upon the construction 
of simple bigram language models using the maximum likelihood criterion. The algorithm 
gives a hierarchical binary classification of words that has been used for a variety of pur- 
poses, including the construction of decision tree language and parsing models [28], and 
sense disambiguation for machine translation [11]. 

A fundamental shortcoming of the mutual information word clustering algorithm given 
in [10] is that it takes as fundamental the word spellings themselves. This increases 
the severity of the problem of small counts that is present in virtually every statistical 
learning algorithm. For example, the word "Hamiltonianism" appears only once in the 
365,893,263-word corpus used to collect bigrams for the clustering experiments described 
in [10]. Clearly this is insufficient evidence on which to base a statistical clustering decision. 
The basic motivation behind the feature-based approach is that by querying features of 
spellings, a clustering algorithm could notice that such a word begins with a capital letter, 
ends in "ism" or contains "ian," and profit from how these features are used for other 
words in similar contexts. 

In this section we describe how we applied the random field induction algorithm to 
discover morphological features of words, and we present sample results. This technique 
was used in [27] to improve mutual information clustering. In Section 5.1 we formlate 
the problem in terms of the notation and results of Sections 2, 3, and 4. In Section 5.2 
we describe how the field induction algorithm is actually carried out in this application. 
In Section 5.3 we explain the results of the induction algorithm by presenting a series of 
examples. 

5.1 Problem formulation. To discover features of spellings we take as configuration space 
O = A* where A is the ASCII alphabet. We construct a probability distribution p(u) on 
O by first predicting the length \ u |, and then predicting the actual spelling; thus, p(u) = 
pi(\io \)p s (uj | \uj I) where p\ is the length distribution and p s is the spelling distribution. 
We take the length distribution as given. We model the spelling distribution p s (- \l) over 
strings of length I as a random field. Let be the configuration space of all ASCII strings 
of length Z. Then | | = O(10 2 ') since each Ui is an ASCII character. 

To reduce the number of parameters, we tie features so that a feature has the same 
weight independent of where it appears in the string. Because of this it is natural to view 
the graph underlying as a regular Z-gon. The group of automorphisms of this graph is 
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the set of all rotations, and the resulting field is homogeneous as defined in Section 2. 

Not only is each field pi homogeneous, but in addition, we tie features across fields for 
different values of I. Thus, the weight Aj of a feature is independent of I. To introduce a 
dependence on the length, as well as on whether or not a feature applies at the beginning 
or end of a string, we adopt the following artificial construction. We take as the graph 
of 0; an (7 + l)-gon rather than an /-gon, and label a distinguished vertex by the length, 
keeping this label held fixed. The graph for a 7-letter word is depicted in Figure 3. The 
dashed lines indicate edges that would need to be present for the field to be Markovian if 
each feature is supported by no more than three vertices. 



<7> 




m 



Figure 3 



To complete the description of the fields that are induced, we need to specify the set 
of atomic features. The atomic features that we allow fall into three types. The first type 
is the class of features of the form 



fvA") = { 



1 if uj v = c 
otherwise. 



where c is any ASCII character. The second type of atomic features involve the special 
vertex <l> that carries the length of the string. These are the features 



/»,fM = { 
fv,<>(u) = { 



1 if uj v = <l> 



otherwise 

1 if uj v = <l> for some I 
otherwise 



The atomic feature /„ <> introduces a dependence on whether a string of characters lies 
at the beginning or end of the string, and the atomic features f v j introduce a dependence 
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on the length of the string. To tie together the length dependence for long strings, we also 
introduce an atomic feature f v j+ for strings of length 7 or greater. 

The final type of atomic feature asks whether a character lies in one of three sets, 
[a-z] , [A-Z] , [0-9] , [@-&] , denoting arbitrary lowercase letters, uppercase letters, digits, 
and punctuation. For example, the atomic feature 

f ( u \ _ f 1 if G [a-z] 

•V[a-z]l ) \q otherwise 

tests whether or not a character is lowercase. 

To illustrate the notation that we use, let us suppose that the the following features 
are active for a field: "ends in ism," "a string of at least 7 characters beginning with a 
capital letter" and "contains ian." Then the probability of the word "Hamiltonianism" 
would be given as 

P;(14)p s (Hamiltonianism | | to \ = 14) = P,(14) e A 7+< [A-Z] +A ian+ A ism> _ 

Z14 

Here the A's are the parameters of the appropriate features, and we use the characters < 
and > to denote the beginning and ending of a string (more common regular expression 
notation would be " and $). The notation 7+< [A-Z] thus means "a string of at least 7 
characters that begins with a capital letter," corresponding to the feature 

fv,7+ /„, [A-Z] • 

Similarly, ism> means "ends in -ism" and corresponds to the feature 

fv,i f v , s f v , m fv,<> 
and ian means "contains ian," corresponding to the feature 

fv,i fv,a. fv,n ■ 



5.2 Description of the algorithm. We begin the random field induction algorithm with a 
model that assigns uniform probability to all word strings. We then incrementally add 
features to a random field model in order to minimize the Kullback-Leibler divergence 
between the field and the unigram distribution of the vocabulary obtained from a training 
corpus. The length distribution is taken according to the lengths of words in the empirical 
distribution of the training data. The improvement to the model made by a candidate 
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feature is evaluated by the reduction in relative entropy, with respect to the unigram 
distribution, that adding the new feature yields, keeping the other parameters of the model 
fixed. Our learning algorithm incrementally constructs a random field to describe those 
features of spellings that are most informative. 

At each stage in the induction algorithm, a set of candidate features is constructed. 
Because the fields are homogeneous, the set of candidate features can be viewed as follows. 
Each active feature can be expressed in the form 



where s is a string in the extended alphabet A of ASCII characters together with the 
macros [a-z] , [A-Z] , [0-9], [@-&] , and the length labels <1> and <>. If {/ s }sg,5 is the 
set of active features, (including s = e) using this representation, then the set of candidate 
features is precisely the set 



where a ■ s denotes concatenation of strings. As required by Definition 2, each such candi- 
date increases the support of an active feature by a single adjacent vertex. 

Since the model assigns probability to arbitrary word strings, the partition function 
Z[ can be computed exactly for only the smallest string lengths I. We therefore compute 
feature expectations using a random sampling algorithm. Specifically, we use the Gibbs 
sampler [25] to generate 10,000 spellings of random lengths. When computing the gain 
G q (g) of a candidate feature, we use these spellings to estimate the probability that the 
candidate feature g occurs k times in a spelling (see equation (3.7)-for example, the feature 
f v [ a - z ] occurs 2 times in the string The), and then solve for the corresponding (3 using 
Newton's method for each candidate feature. It should be emphasized that only a single 
set of random spellings needs to be generated; the same set can be used to estimate g^ for 
each candidate g. After adding the best candidate to the field, all of the feature weights are 
readjusted using the Improved Iterative Scaling algorithm. To carry out this algorithm, 
random spellings are again generated, this time incorporating the new feature, yielding 
Monte Carlo estimates of the coefficients a^\. Recall that is the expected number of 
times that feature i appears (under the substring representation for homogeneous features) 
in a string for which there is a total of m active features (see equation (4.11)). Given 
estimates for these coefficients, Newton's method is again used to solve equation (4.11), 
to complete a single iteration of the iterative scaling algorithm. After convergence of the 
Kullback-Leibler divergence, the inductive step is complete, and a new set of candidate 
features is considered. 




1 substring s appears in u 
otherwise 



a-si J s-a $ a£A,s£S 
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5.3 Sample results. We began with a uniform field, that is, a field with no features at all. 
For this field, all ASCII strings of a given length are equally likely, and the lengths are 
drawn from a fixed distribution. Here is a sample of strings drawn from this distribution: 

~, mo, _!ZP*@, m/TLL , ks;cm_3, *LQdR, D, aW{ , 5&TL|4, tc, ?!@, 
sNeiO+, wHo8zBr", pQlV, m, H!&, h9 , #0s, :, Ky}FM? , LW, ",8}, 
89Lj, -P, A, !, H, Y~:Du:, lxCl, l!'J#F*u., w=idHnM) , ~, 2, 
21eW2, I,bw~tkl, 3", ], ], b, +JEmj6, +E* , \qjqe"-7f, |al2, T, 
~(s0cl+2ADe, &, \p9oH, i;, $6, q}0+[, xEv, #U, 0)[83C0F, 
= |B|7°/ cR, Mqq, ?!mv, n=7G, $i9GAJ\, D, 5, , = , +u6@I9:, +, =D, 
2E#vz@3-, ~nu;.+s, 3xJ, GDWeqL , R,3R, !7v, FX,@y, 4p_cY2hU, ~ 

It comes as no surprise that the first feature the induction algorithm chooses is [a-z] ; 
it simply observes that characters should be lowercase. The maximum likelihood (maxi- 
mum entropy) weight for this feature is (3 = e x 6.99. This means that a string with a 
lowercase letter in some position is about 7 times more likely than the same string without 
a lowercase letter in that position. 

When we now draw strings from the new distribution (using annealing to concentrate 
the distribution on the more probable strings), we obtain spellings that are primarily made 
up of lowercase letters, but that certainly do not resemble English words: 

m, r, xevo, ijjiir, b, to, jz, gsr, wq, vf , x, ga, msmGh, 
pep, d, oziVlal, hzagh, yzop, io, advzmxnv, ijv_bolft, x, 
emx, kayerf , mlj , rawzyb, jp, ag, ctdnnnbg, wgdw, t, kguv, 
cy, spxcq, uzflbbf, dxtkkn, cxwx, jpd, ztzh, lv, zhpkvnu, 
1~, r, qee, nynrx, atze4n, ik, se, w, lrh, hp+, yrqyka'h, 
zengotenx, igcump, zjcjs, lqpWiqu, cefmfhc, o, lb, fdcY, tzby, 
yopxmvk, by, fz,, t, govyeem, ijyiduwfzo, 6xr, duh, ejv, pk, 
pjw, 1, fl, w 

In the following table we show the first 10 features that the algorithm induced, together 
with their associated parameters. Several things are worth noticing. The second feature 
chosen was [a-z] [a-z] , which denotes adjacent lowercase characters. The third feature 
added was the letter e, which is the most common letter. The weight for this feature is 
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(3 = e x = 3.47. The next feature introduces the first dependence on the length of the 
string: [a-z] >1 denotes the feature "a one character word ending with a lowercase letter." 
Notice that this feature has a small weight of 0.04, corresponding to our intuition that 
such words are uncommon. Similarly, the features z, q, j , and x are uncommon, and 
thus receive small weights. The appearance of the feature * is explained by the fact that 
the vocabulary for our corpus is restricted to the most frequent 100,000 spellings, and 
all other words receive the "unknown word" spelling ***, which is rather frequent. (The 
"end-of-sentence" marker, which makes its appearance later, is given the spelling |.) 



feature 
P 



[a-z] [a-z] [a-z] e [a-z]>l t * z q j x 
6.64 6.07 3.47 0.04 2.75 17.25 0.02 0.03 0.02 0.06 



Shown below is a collection of spellings obtained by Gibbs sampling from the resulting 
collection fields. 



frk, et , egeit, edet , eutdmeeet, ppge, A, dtgd, falawe, etci, 
eese, ye, epemtbn, tegoeed, ee, *mp, temou, enrteunt, ore, 
erveelew, heyu, rht , *, lkaeu, lutoee, tee, mmo , eobwtit, 
weethtw, 7, ee, teet, gre, /, *, eeeteetue, hgtte, om, he, *, 
stmenu, ec, ter, eedgtue, iu, ec, reett , *, ivtcmeee, vt , eets, 
tidpt, lttv, *, etttvti, ecte, X, see, *, pi, rlet, tt , *, eot , 
leef, ke, *, *, tet, iwteeiwbeie, yeee, et , etf, *, ov 

After inducing 100 features, the model finally begins to be concentrated on spellings 
that resemble actual spellings to some extent, particularly for short words. At this point 
the algorithm has discovered, for example, that the is a very common 3-letter word, that 
many words end in ed, and that long words often end in ion. A sample of 10 of the first 
100 features induced, with their appropriate weights is shown in the table below. 



feature 


,>1 3<the 


tion 


4<th 


y> 


ed> 


ion>7+ 


ent 


7+<c 


P 


22.36 31.30 11.05 


5.89 


4.78 


5.35 


4.20 


4.83 


5.17 


5.37 
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thed, and, thed, toftion, I , ieention, cention, I , ceetion, 
ant, is, seieeet, cinention, and, ., tloned, uointe, feredten, 
iined, sonention, inathed, other, the, id, and, ,, of, is, of, 
of, ,, leers, ,, ceeecion, ,, roferented, I, ioner, ,, I, the, 
the, the, centention, ionent, asers, ,, ctention, I, of, thed, 
of, uentie, of, and, ttentt, in, rerey, and, I, sotth, cheent , 
is, and, of, thed, rontion, that, seoftr 

A sample of the first 1000 features induced is shown in the table below, together with 
randomly generated spellings. Notice, for example, that the feature [0-9] [0-9] appears 
with a surprisingly high weight of 9382.93. This is due to the fact that if a string contains 
one digit, then it's very likely to contain two digits. But since digits are relatively rare 
in general, the feature [0-9] must have a small weight of 0.038. Also, according to the 
model, a lowercase letter followed by an uppercase letter is rare. 



feature 


s> 


<re 


ght> 


3<[A-Z] 


iy> 


al>7+ 


ing> 


P 


7.25 


4.05 


3.71 


2.27 


5.30 


94.19 


16.18 


feature 


[a-z] [A-Z] 


't> 


ed>7+ 


er>7+ 


ity 


ent>7+ 


[0-9] [0-9] 


P 


0.003 


138.56 


12.58 


8.11 


4.34 


6.12 


9382.93 


feature 


qu 


ex 


ae 


ment 


ies 


<wh 


ate 


P 


526.42 


5.265 


0.001 


10.83 


4.37 


5.26 


9.79 



was, reaser, in, there, to, will, ,, was, by, homes, thing, 
be, reloverated, ther, which, conists, at, fores, anditing, with, 
Mr., proveral , the, ,, ***, on't, prolling, prothere, ,, mento, 
at, yaou, 1, chestraing, for, have, to, intrally, of, qut , ., 
best, compers, ***, cluseliment, uster, of, is, deveral , this, 
thise, of, of feet, inatever, thifer, constranded, stater, vill, 
in, thase, in, youse, menttering, and, ., of, in, verate, of, to 

Finally, we visit the state of the model after growing 1500 features to describe words. 
At this point the model is making more refined judgements regarding what is to be consid- 
ered a word and what is not. The appearance of the features {}> and \ [@-&] {, is explained 
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by the fact that in preparing our corpus, certain characters were assigned special "macro" 
strings. For example, the punctuation characters $, _, %, and & are represented in our 
corpus as \${}, \_{}, Y/oO, and \&{}. As the following sampled spellings demonstrate, 
the model has at this point recognized the existence of macros, but has not yet discerned 
their proper use. 



feature 


7+<inte 


prov 


<der 


<wh 


19 


ons>7+ 


ugh 


ic> 


P 


4.23 


5.08 


0.03 


2.05 


2.59 


4.49 


5.84 


7.76 


feature 


sys 


ally 


7+<con 


ide 


nal 


{» 


qui 


\ [<§-&]{ 


P 


4.78 


6.10 


5.25 


4.39 


2.91 


120.56 


18.18 


913.22 


feature 


iz 


IB 


<inc 


<im 


iong 


$ 


ive>7+ 


<un 


P 


10.73 


10.85 


4.91 


5.01 


0.001 


16.49 


2.83 


9.08 



the, you, to, by, conthing, the, ., not, have, devened, been, 
of, |, F., ., in, have, -, ,, intering, ***, ation, said, 
prouned, ***, suparthere, in, mentter, prement , intever, you, 
., and, B., gover, producits, alase, not, conting, comment, 
but, |, that, of, is, are, by, from, here{}, incements, contive, 
., evined, agents, and, be, \.{}, thent , distements, all, — , 
has, will, said, resting, had, this, was, intevent, IBM, 
whree, acalinate, herned, are, ***, 0., I, 1980, but, will, 
***, is, ., to, becoment , ., with, recall, has, I, nother, 
ments, was, the, to, of, stounicallity , with, camanfined, 
in, this, intations, it, conanament , out, they, you 

While clearly the model still has much to learn, it has at this point compiled a significant 
collection of morphological observations, and has traveled a long way toward its goal of 
statistically characterizing English spellings. 

6. Extensions and Relations to Other Approaches 

In this section we briefly discuss some relations between our incremental feature induction 
algorithm for random fields and other statistical learning paradigms. We also present some 
possible extensions and improvements of our method. 
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6.1 Boltzmann machines. There is an immediate resemblence between the parameter es- 
timation problem for the random fields that we have considered and the learning problem 
for Boltzmann machines [1]. The classical Boltzmann machine is considered to be a ran- 
dom field on a graph G = (E,V) with configuration space = {0, 1}^ consisting of all 
labelings of the vertices by either a zero or a one. The machine is specified by a probability 
distribution on this configuration space of the form 



and the learning problem is to determine the set of weights \ij that best characterize a set 
of training samples. Typically only a subset of the vertices are labeled in the training set; 
the remaining vertices are considered to comprise the hidden units. Treated as a maximum 
likelihood problem, the training problem for Boltzmann machines becomes an instance of 
the general problem addressed by the EM algorithm [21], where iterative scaling is carried 
out in the M-step [12]. 

Most often the architecture of a Boltzmann machine is prescribed, and the learning 
problem is then solved by applying the EM algorithm (which typically involves random 
sampling and annealing). To cast Boltzmann machines into our framework, we can simply 
take binary- valued features of the form fij(uj) = UiUj. More generally, by allowing binary- 
valued features of the form 



for v = (y\ , . . . , v n ) a path in G = (E } V) } we construct models that are essentially "higher- 
order" Boltzmann machines [32]. With candidate features of this form our algorithm 
incrementally constructs a Boltzmann machine with no hidden units. If only a subset 
of the vertices are labelled in the training samples, then our Improved Iterative Scaling 
algorithm becomes an instance of the EM algorithm. 

6.2 Decision trees. Our feature induction method also bears some resemblence to various 
methods for growing decision trees. Like decision trees, our method builds a top-down 
classification that refines features. However, decision trees correspond to constructing 
features that have disjoint support. For example, binary decision trees are grown by 
splitting a mode n into two nodes by asking a binary question q n of the data at that node. 
Questions can be evaluated by the amount by which they reduce the entropy of the data 
at that node. This corresponds to our criterion of maximizing the reduction in entropy 
G q (g) over all candidate features g for a field q. When the decision tree has been grown 
to completion, each leaf I corresponds to a sequence of binary features 





fh flt-> fm-> ■■■■> /root 
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where w\ denotes the parent of node n, and with each feature f n being either the question q n 
or its negation ~^q n . Thus, each leaf / is characterized by the conjunction of these features, 
and different leaves correspond to conjunctions with disjoint support. In contrast, our 
feature induction algorithm generally results in features that have overlapping support. 

By modifying our induction algorithm in the following way, we obtain a direct gener- 
alization of binary decision trees. Instead of considering the 1-parameter family of fields 
q\^ g to determine the best candidate g = a A/, we consider the 2-parameter family of fields 
given by 

1 AaA/+A'(-.a)A/ 

q\,\', g = 7 e ^ > J . 

Since the features a A / and (~->a) A / have disjoint support, the improvement obtained by 
adding both of them is given by G q (a A /) + G q ((^a) A /). This procedure generalizes 
decision trees since the resulting features in the field can be overlapping. 

6.3 Dynamic Markov coding. Another technique that is similar in some aspects to random 
field induction is the dynamic Markov coding technique for text compression [14,5]. To 
incrementally build a finite state machine for generating strings in some output alphabet, 
dynamic Markov coding is based on the heuristic that the relative entropy of the finite- 
state machine might be lowered by giving a unique destination state to arcs that have 
high count. At each stage in the algorithm a state in the machine is split, or "cloned," 
into two states. The arc with the highest count coming into the original state is attached 
to one of the new states, and all of the remaining input arcs are attached to the other 
new state. As shown in [5], this technique is equivalent to incrementally building a finite- 
context model, adding a single output symbol s to a valid prefix a to form a new valid 
prefix s • a. In this way it is similar to our field induction algorithm which at each stage 
generates a new feature of the form a Af for some active feature /. However, because of the 
"on-line" nature of dynamic Markov coding, the technique is unable to precisely calculate 
the reduction in entropy due to splitting a state, and must instead rely on more primitive 
heuristics. A closely related technique is given in [31]. In [33] a method for building hidden 
Markov models is presented which is in some sense the opposite approach, in that it starts 
with a maximally detailed finite-state model and proceeds by incrementally generalizing 
by merging states according to a greedy algorithm. 

6.4 Conditional exponential models. Almost all of what we have presented here carries over 
to the more general setting of conditional exponential models, including the Improved Iter- 
ative Scaling algorithm presented here. For general conditional distributions p(y | x) there 
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may be no underlying random field, but with features defined as binary functions f(x, y), 
the same general approach is applicable. The feature induction method for conditional 
exponential models is demonstrated for several problems in statistical machine translation 
in [6], where it is presented in terms of the principle of maximum entropy. 

6.5 Extensions. The random field induction method presented in this paper is not defini- 
tive; there are many possible variations on the basic theme, which is to incrementally 
construct an increasingly detailed field to approximate the reference distribution p. Be- 
cause the basic technique is based on a greedy algorithm, there are of course many ways 
for improving the search for a good set of features. The algorithm presented in Section 2 
is in some respects the most simple possible within the general framework. But it also 
computationally intensive. A natural modification would be to add several of the top 
candidates at each stage. While this should increase the overall speed of the induction 
algorithm, it would also potentially result in more redundancy among the features, since 
the top candidates could be correlated. Another modification of the algorithm would be to 
add only the best candidate at each step, but then to carry out parameter estimation only 
after several new features had been added to the field. It would also be natural to establish 
a more Bayesian framework in which a prior distribution on features and parameters is 
incorporated. This could enable a principled approach for deciding when the feature in- 
duction is complete, by evaluating the posterior distribution of the field given the training 
samples. 

As mentioned above, the method presented here does not explicitly learn any hidden 
structure, and thus does not generalize as much as would be desirable for many appli- 
cations. One possibility would be to combine our method with a merging technique for 
combining features in order to generalize from a more detailed set of observations. While 
in principle our learning method can be carried out in the presence of incomplete data (in 
which case iterative scaling of the parameters can be viewed as an EM algorithm), we have 
not investigated searching methods for revealing hidden structure. This is a promising 
direction for future research. 
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